Entropy-Based Authorship Search in Large Document Collections

نویسندگان

  • Ying Zhao
  • Justin Zobel
چکیده

The purpose of authorship search is to identify documents written by a particular author in large document collections. Standard search engines match documents to queries based on topic, and are not applicable to authorship search. In this paper we propose an approach to authorship search based on information theory. We propose relative entropy of style markers for ranking, inspired by the language models used in information retrieval. Our experiments on collections of newswire texts show that, with simple style markers and sufficient training data, documents by a particular author can be accurately found from within large collections. Although effectiveness does degrade as collection size is increased, with even 500,000 documents nearly half of the top-ranked documents are correct matches. We have also found that the authorship search approach can be used for authorship attribution, and is much more scalable than state-of-art approaches in terms of the collection size and the number of candidate authors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Authoritative Re-Ranking in Fusing Authorship-Based Subcollection Search Results

We examine the use of authorship information to divide IR test collections into subcollections and apply techniques from the field of distributed information retrieval to enhance the baseline search results. We determine the expertise of each author, based on the content of their documents, and use this knowledge to construct rankings of the different author subcollections for each query. We go...

متن کامل

Author Identification on the Large Scale

Individuals have distinctive ways of speaking and writing, and there exists a long history of linguistic and stylistic investigation into authorship attribution. In recent years, practical applications for authorship attribution have grown in areas such as intelligence (linking intercepted messages to each other and to known terrorists), criminal law (identifying writers of ransom notes and har...

متن کامل

An Analysis of Ministry of Education’s Strategic Plans Based on Favorable Components of English Language Teaching Using Shannon’s Entropy

The present research aims to analyze the content of Ministry of Education’s strategic plans (the Fundamental Reform Document of Education, the Comprehensive National Scientific Plan and the National Curriculum Document) based on Shannon's entropy regarding the favorable components of teaching English. The contents of the Fundamental Reform Document of Education, the Comprehensive National Scien...

متن کامل

The Reliability of Metrics Based on Graded Relevance

Improving weak ad-hoc retrieval by Web assistance and data fusion p. 17 Query expansion with the minimum relevance judgments p. 31 Improved concurrency control technique with lock-free querying for multi-dimensional index structure p. 43 A color-based image retrieval method using color distribution and common bitmap p. 56 A probabilistic model for music recommendation considering audio features...

متن کامل

Retrieval from Document Image Collections

This paper presents a system for retrieval of relevant documents from large document image collections. We achieve effective search and retrieval from a large collection of printed document images by matching image features at word-level. For representations of the words, profile-based and shape-based features are employed. A novel DTWbased partial matching scheme is employed to take care of mo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007